Nonlinear canonical correspondence analysis and its application

The canonical correspondence analysis (CCA) is a multivariate direct gradient analysis method performing well in many fields, however, when it comes to approximating the unimodal response of species to an environmental gradient, which still assumes that the relationship between the environment and the weighted species score is linear. In this work, we propose a nonlinear canonical correspondence analysis method (NCCA), which first determines the most appropriate nonlinear explanatory factor through two screenings by correlation and LASSO regression, and successively uses the linear regression method and the improved heuristic optimal quadratic approximation method to fit the chi-square transformation values of the response variables. Thus, our method effectively reflects the nonlinear relationship between the species and the environment factors, and a biplot is employed to visualize the effects of the later on the distribution of species. The results from applying this method over a real dataset show that the NCCA method not only maintains the advantages of the polynomial canonical correspondence analysis (PCCA) proposed by Makarenkov (2002), but also outperforms Makarenkov’s method in explaining the variance of response variables.


The nonlinear canonical correspondence analysis method
Here, the Y is assumed to be a species matrix of size n × p , where n is the number of sites, p is the number of species, and the column y j of the matrix Y represents the j-th species vector, expressed as y j = (y 1j , y 2j , . . . , y nj ) T , where y ij represents the species abundance, presence/absence or other frequency data of species j observed in site i. And the sum of the row i in Y is denoted by y i+ = p j=1 y ij , the sum of the column j in Y is denoted by y +j = n i=1 y ij , and the total sum of Y is denoted by y ++ = n i=1 p j=1 y ij . Then we define the row weight p i+ = y i+ y ++ , the column weight p +j = y +j y ++ and the relative frequency p ij = y ij y ++ . With the row weights as the diagonal elements, such as D(p i+ ) = diag {p 1+ , p 2+ , . . . , p n+ } , a diagonal matrix is constructed. The X is an explanatory variables matrix of size n × m , where m is the number of the explanatory variables, the rows represent the same sites as Y, and the columns of the matrix X represent the explanatory variables observed by the sample sites. The basic steps of the LCCA were described at length in several literature 22,32,33 . The LCCA is calculated based on a matrix Y * obtained by the following chi-square transformation and a weighted matrix X t of the explanatory variable X The second step for calculating the LCCA is to calculate the regression fitting value matrix Ŷ = X t B of Y * with the multiple linear regression method which reflects a linear relationship between Y * and the weighted explanatory variable X t . If the LCCA model is not significant, we need to further consider whether there is a nonlinear relationship between Y * and X t . Even if the LCCA model is significant, we can consider whether there is a more significant nonlinear relationship between Y * and X t than the linear relationship. Our aim is to generalize this relation to a nonlinear one. The algorithm in this section is thus designed to represent the response variable Y * as some nonlinear functions of the most relevant weighted explanatory variables. In our study, the cross term of nonlinear factors is still introduced. But for avoiding the over-fitted response variables, the number of nonlinear factors in the regression should be minimized, the reason for which is if the response variable in the regression is provided by more than (n − 1) explanatory variables, the over-fitting will occur, where n is the number of observations.
The NCCA method mainly includes the following steps.
(1) Similar to the LCCA method, the matrix Y * is obtained from the matrix Y by the chi-square transformation (1). One of the differences from the LCCA is to find the nonlinear forms of environmental data significantly relating to the species distribution, and which are determined by the following process.
(1) Preliminarily calculating the nonlinear factor values and the centralization.
For some nonlinear functions f 1 , f 2 , . . . , f K , we may obtain an explanatory data matrix where f 0 (X) = X . By taking the column sum of D(p i+ )X e as the weighted average X e , the matrix X e is centered to and then is weighted with D( √ p i+ ) (2) Selecting the nonlinear environmental factors by the correlation coefficients and the significance level Given a significance level of the correlation coefficient α , we first calculate the correlation coefficients cor (i, f k (x j )) and the significance p values p (i, f k (x j )) of any column f k (x j ) in the environmental data matrix X c and any column i in the data matrix Y * of species, where i = 1, 2, . . . , p , j = 1, 2, . . . , m and k = 0, 1, 2, . . . , K . Then we select the nonlinear factors f k (x j ) with the significant correlation ( p (i, f k (x j )) < α ) and the absolute value of the correlation coefficient cor (i, f k (x j )) is greater when compared to that of the cor (i, f 0 (x j )) between the original variable x j and the species i.
As we know, the over-fitting will occur if the number of the selected environmental factors is greater than n − 1 . Therefore, further selecting the nonlinear column variables by employing the absolute values of the maximum species correlation coefficients of the top n − 1 − m from large to small is necessary.
Under no nonlinear environmental factor being selected, we need to change a new set of nonlinear functions f 1 , f 2 , . . . , f K or modify the significance level α.
If there is still no an appropriate nonlinear environmental factor selected, but several groups of different nonlinear functions are replaced repeatedly and the significance level is vary large (e.g. above 0.2), our data are not suitable for analyzing with the NCCA.
By above selection process, a matrix X expand is obtained In the following processing, the matrix X expand is similarly centered to the expression (4). 2) Estimating the distribution of species by using the nonlinear environmental factors and the LASSO regression Based on the requirements of penalized regression equation, the LASSO regression 34 is performed for each species y * j with respect to X w = D( √ p i+ )X expand . The factors in X expand , which are significantly relating to the species y * j , are selected to form a new explanatory variable matrix X y * j expand of the species y * j . At the same time, a regression coefficients vector b y * j with respect to the matrix X y * j w = D( √ p i+ )X y * j expand is obtained. Therefore, the estimated species abundance ŷ * j can be calculated as follows, By deleting the same columns and retaining the different columns in different matrices X y * j expand , we can obtain the new matrix X * In this variable screening process, we need to save the relationship between each column in X * and the original explanatory variable, so as to facilitate subsequent variable analysis.
If the number k of elements in set X y * j expand is equal to n − 1 , we skip the following step 3).

3) Considering the interaction among the explanatory variables
It is noted that the species abundance may be affected by the interaction among the explanatory variables, which is also needed to be considered. Obviously, the estimated response variable value ŷ * j as (5) does not reflect the interaction among the explanatory variables. But the difference y * j,res between the observed and the estimated response variable value reflects this interaction if it exists. By constructing the cross term matrix and using the stepwise regression method, we can obtain the cross factor estimation of the residual y * j,res . However, the large number of cross X expand = [X, non-linear functions of X strongly related to Y * ]. www.nature.com/scientificreports/ terms make it difficult to control the number of cross terms in the regression estimation expression, this method is thus not widely used in practical calculation. V.Makarenkov provided a heuristic optimal quadratic approximation method aimming at above residuals 22 . We modify the method by our requirements as follows. We consider only the cross term and not the square term in constructing the design matrix.
For this reason, for each pair of columns r and s in X y * j expand , by calculating a simple linear regression of the residual vector y * j,res on cross term x r x s , where x r , x s are two columns in X y * j expand , we obtain the regression estimation ŷ * rs j,res of the residual vector y * j,res on the variables x r and x s as follows Then, we calculate the adjusted determination coefficient R 2 adj (r, s), where k represents the number of elements of set X y * j expand , ȳ * j,res is the mean of vectors y * j,res , and y * j,res,i ( ŷ * rs j,res,i ) represents the i-th element of the vector y * j,res ( ŷ * rs j,res ). Repeating this process for each column pair (r, s) of the matrix X y * j expand , and comparing all the new determination coefficients with the original one, then we can obtain the columns pair (r, s) with the largest and better determination coefficient R 2 adj (r, s) , that is, after introducing their cross term, the adjusted determination coefficient is the largest and higher than that when they are not introduced. By combining the two selected explanatory variables x r and x s , a new join explanatory variable vector x join is formed as follows, The new combined explanatory variable x join includes the linear and cross term contributions of the variables x r and x s fitting y * j . By deleting the columns r and s and adding a new column vector x join , the number of columns of the matrix X y * j expand is one less than before. We repeat the previous process in step (2)-(3) until the matrix X y * j expand is converted to a matrix with only one column or no columns pair meeting the above conditions is found.
Finally, by calculating a linear regression of y * j on the final matrix X y * j expand , we obtain the estimated value ŷ * j of the species y * j . For each species variable y * j ( j = 1, 2, . . . , p) , repeating above steps (2) and (3) may get the regression estimations of these variables. Finally, the regression estimation matrix Ŷ is formed by Obviously, the final regression fitting value ŷ * j is a linear combination of some original environmental factors, some nonlinear functions of the original environmental factors, and some of their cross terms. Therefore, Ŷ reflects the nonlinear effect of the environmental factors on the species.
The following steps are basically the same as those of the LCCA, including the calculation of the eigenvalues and the eigenvectors of the covariance matrix SŶ TŶ =Ŷ TŶ , the calculations of the site scores, the species scores and the ordination axis scores of different scale types, which can be referred to the equations (C.6)-(C.11) of APPENDIX C in the literature 22 . But in order to reflect the nonlinear influence of the explanatory variables, the calculation methods of the explanatory variable scores are different from the LCCA.

Representing the nonlinear terms of the explanatory variables with biplots
Each column in the matrix X * of the expression (6) represents a selected explanatory variable or a nonlinear term. When drawing a biplot, we need to reflect the role of all the column variables of X * .
There are the following two approaches for calculating the explanatory variable scores. Strategy (1) If the matrix X * does not contain too many variables (the columns of X * ), the weighted correlation coefficients R Z,x of any variable x in the matrix X * and the ordination score Z of each ordination axis can be calculated by the formula (C.12) of APPENDIX C in the literature 22 . Using these weighted correlation coefficients as the coordinates of the arrows in the biplot, the explanatory variable x or the nonlinear factor can be drawn. In this way, if there are too many variables in X * , then too many arrows may present in the biplot, which can lead to the crowding. Under this case, the following second approach can be considered.
Strategy (2) The second approach is only drawing the arrows of the original explanatory variables, the position of an arrow is determined by x, and its nonlinear components. Let x be an original explanatory variable in X and f 1 (x) , f 2 (x) , . . . , f a (x) are its nonlinear components respectively corresponding to the i 1 , i 2 , . . . i a columns of X * , that is, . . , f a (x)} . The multiple correlation coefficient of the explanatory variable x and the ordination score Z of each ordination axis is calculated as follows, y * rs j,res = c rs 1 x r x s + c rs 0 .
(10) Y = (ŷ * 1 ,ŷ * 2 , · · · ,ŷ * p ). where r (u,v) is the correlation coefficient of two vectors u and v, the matrix R (Z,x) = [r (Z,x) , r (Z,f 1 (x)) , · · · , r (Z,f a (x)) ] T , R (x,x) is of the following form, The multiple correlation coefficient between the explanatory variable x and each coordinate axis is used as the coordinate of the arrow of the explanatory variable x in the biplot. Thus the arrow in the biplot reflects the nonlinear components of the explanatory variable x.
Similar to the LCCA, there are two scaling methods for drawing a biplot: scaling type 1 and scaling type 2 ( 22 , see APPENDIX C). For scaling type 1, the coordinate of explanatory variables in the biplot still needs to multiply above multiple correlation coefficient by a weight where k is the eigenvalue of the covariance matrix SŶ T , and the subscript k represents the k-th ordination axis.

Significance test in the NCCA
The permutation test is employed to examine the validity of our method. In the permutation test, a statistic is recalculated on every sampling by randomly changing the position of the tested element, so that the simulated statistical values constitute a reference distribution. Now, the position of the statistical value of the current actual data in the reference distribution can be calculated, and the p value of the actual statistical value is the probability statistical value of the position of the actual statistical value 35 .
The null hypothesis H 0 for testing the significance of the NCCA is to assume that there is no nonlinear relationship between the response variables and the explanatory variables, that is, there is no nonlinear relationship between the variables in the matrix X expand and Y * . Similar to the test process discussed by Vladimir Makarenkov and Pierre Legendre 22 , the pseudo F statistic is defined, where var (U) denotes the variance of U. The permutation test process of the NCCA is as follows.
(1) For a real data set, using the NCCA method to obtain Ŷ , and calculating with the expression (14) the pseudo F statistic. (2) Permuting the rows of matrix Y * , and getting a matrix Y * perm . (3) For the data set Y * perm and X expand (unpermuted), using the NCCA method to obtain Ŷ perm . (4) Calculating with the expression (14) the pseudo F perm statistic for the permuted data Y * perm and Ŷ perm . (5) Calculating the occurrence frequency p of event {F perm > F} and the standard deviation δ p of p in the previous N 1 samplings. (6) If the standard deviation δ p is less than a given very small threshold ǫ , stopping permutation sampling, otherwise repeating steps (2)-(5) N times (for example, N = 999).
Finally making the test conclusion. If p is less than a given significance level α (e.g., 0.01 or 0.05), rejecting the null hypothesis H 0 36 , and considering the NCCA model is significant. Otherwise, accepting the null hypothesis H 0 . If both the NCCA and other CCA models are significant, then it is necessary to determine which model is more suitable to describe and analyze the current data. Also, the permutation test can be used to evaluate the variance difference between the NCCA model and other CCA models. Now, the null hypothesis H 0 is that there is no difference between the two CCA methods. And the pseudo F statistic for this difference is The test process is consistent with the previous steps (1)- (6). The slight difference is that after each permutation, the species estimations of the two CCA models, Ŷ NCCA and Ŷ other CCA , need to be calculated.

Numerical simulation
The purpose of this section is to analyze the errors of permutation test results described in the previous section and discuss whether it can provide an effective significance testing. When there is no relationship between X and Y * , if the null hypothesis is rejected by the permutation test, the type I error occurs, or if the null hypothesis is accepted when there is a nonlinear relationship between X and Y * , the type II error occurs. We will still analyze whether this method has a good ability to distinguish between linear and nonlinear relationships. For www.nature.com/scientificreports/ this reason, we will use a large number of computer simulation data to obtain the occurrence frequencies of two types of errors of NCCA in the permutation test, so as to illustrate that the NCCA method is correctly-sized and powerful. If the occurrence rates of two types of errors are low, it indicates that the NCCA method is correctly sized and has a high power. We use the first canonical correlation coefficient to measure the extent of correlation between the generated two sets of data X and Y * . We firstly show the NCCA model has correct type I error. When data X and Y * are generated randomly and have a low first canonical correlation coefficient, there is no relationship between them with high probability. Therefore, the null hypothesis should be admitted. If it is denied, the type I error has occurred. In order to evaluate the occurrence frequency of the type I error in the permutation test, the following numerical simulation process is designed. Let . , Y p ) be a n × p matrix of the response variables and X = (X 1 .X 2 , . . . , X m ) be n × m matrix of the explanatory variables.
Step (1) Randomly generating two matrices Y and X, where Y i and X j obey the standard normal distribution ( X j ∼ N(0, 1),Y i ∼ N(0, 1) ), respectively, and shifting them to a non-negative range. Calculating the first canonical correlation coefficient of Xand Y * . If it is greater than a threshold value of the correlation coefficient, two new matrices X and Y will be generated again.
Step (2) Analyzing data sets Y and X using NCCA, respectively.
Step (3) Testing the significance of the relationship between Y and X with the permutation test at different significance levels α , and accumulating the number of times of the event in which NCCA rejects the null hypothesis, respectively, k l , k p , k n .
Step (5) Calculating the rejection rate of the null hypothesis or the frequency of the type I error, respectively, k l /N , k p /N , k n /N.
Our numerical simulation takes the paramters n = 30 , p = 4 , m = 2 , N = 1000 , and performs 499 permutation tests. Following by above steps, we obtain the occurrence frequencies of type I error through N = 1000 experiments at different significance levels α = 0.01, 0.02, 0.03, . . . , 0.20 and five thresholds of the first canonical correlation coefficients r = 0.15, 0.25, 0.361, 0.45, 0.55 and 0.6, which are shown in Table 1. Table 1 shows that the frequencies of type I error for the NCCA model are lower than the corresponding significance levels when the threshold of the first canonical correlation coefficients r ≤ 0.55 . Note that the critical value of the correlation coefficient with a degree of freedom of 28 at the significance level of 0.05 is 0.361. That is say, for two sets of randomly generated data, even if their first canonical correlation coefficient exceeds the critical value by some, the NCCA method can still be correctly sized. Therefore, the simulation shows that NCCA generally does not build a nonlinear model for two sets of unrelated data.
Next, we will discuss the type II error of NCCA. When there is a nonlinear relationship between species abundances and environmental factors, whether the NCCA can correctly find this relationship is a problem reflecting its power. The numerical simulation method is also used in this problem. By generating a large number   www.nature.com/scientificreports/ of data Y and X with some nonlinear relationship, we verify the ability of NCCA to correctly identify this relationship by permutation test. At this case, the rejecting the null hypothesis during the test indicates that NCCA has made a correct judgment. After accumulating the number of rejecting null hypothesis in all experiments, we get the rejecting frequency, that is, the frequency of NCCA correctly identifying the nonlinear relationship. The higher the frequency of rejecting null hypothesis, the stronger the power of NCCA. The basic simulation steps are similar to the above calculation of the frequency of type I error, but the difference is the first step. When generating matrix Y, it should be determined according to some nonlinear functions and noise, that is, is a nonlinear function of X, σ is the noise, which obeys a p-dimensional standard normal distribution, c is the standard deviation of noise which affects the extent of correlation between X and Y.
In order to facilitate the analysis of the next section, we take the regression equations in Table 6 as the functions that generate Y i , and the relevant parameters are also the same as in the next section, namely n = 28 , p = 12 , m = 4 . The standard deviation of noise c = 0.02 . To illustrate the advantages of the NCCA method, we also use LCCA and PCCA to analyze the data generated each time. The simulation results show that when the significance level is 0.01, the frequency of rejecting null hypothesis for the nonlinear relation using NCCA method reaches 95% in N = 1000 random experiments. When the significance level increases from 0.01 to 0.2, the corresponding frequency of rejection also further increases (see the first column on the right of Table 2). That is to say, the NCCA method has a high probability of believing that there is a certain relationship between data Y and X. Therefore, the NCCA method has a strong power to discover nonlinear relationship. However, for the data generated by the nonlinear functions, the rejection frequencies of LCCA and PCCA are also so high (see the second and third columns on the right of Table 2), which indicate that LCCA or PCCA mistakenly identifies the nonlinear relationship as a linear or polynomial relationship in most cases. Therefore, the relationship between the environment and species or between the environment and the sample site revealed by LCCA/PCCA is unreliable. If there is no NCCA method, the nonlinear relationship will be incorrectly analyzed by LCCA or PCCA. In addition, we calculated the frequencies of rejecting null hypothesis under different noise standard deviations c = 0.004 , c = 0.01 , c = 1 , c = 10,c = 30 and c = 65 (see Table 3). The results show that the rejection frequencies increase with the decrease of the noise standard deviation c.
We also use three CCA methods to analyze the data generated by the linear function Y = b 0 + b 1 X + cσ and the cubic polynomial with cross terms where σ obeys standard normal distribution, c = 0.02 . The frequencies of rejecting null hypothesis are also listed in the second-seventh columns of Table 2. It can be seen from Table 2 that when data Y is randomly generated from linear function or polynomial, the rejection frequencies of LCCA or PCCA is relatively high, which reflect their high power to find linear or polynomial relationships. However, due to the fact that the model established by NCCA includes linear and polynomial terms, the rejection frequencies of NCCA is also high, which means it can also detect this linear or polynomial relationship. According to the difference significance test method introduced in the previous section, the frequencies of rejecting null hypothesis for the difference of NCCA with LCCA and that of NCCA with PCCA all reaches 100%.  Table 4, where N(0, 1) represents the standard normal distribution, and b i obeys the uniform distribution on the interval [4,5].

Experiments
Now, we demonstrate our approach through real experiments.
The hunting spider dataset. The research on the hunting spider distribution was a part of 60-weeks extensive ecological project carried out by researchers from the department of animal ecology of Leiden University in the dune area "Meijendel" in the Hague during the period of 1969-1970. These researchers dug up 100 pitfalls with a radius of 1 m to capture spiders, and 12,156 hunting spiders from 12 species in total were captured in 60 weeks. Twenty-six environmental characteristics on the soil, the vegetation and the light around the 28 pitfall sites were measured. For the experimental data, please refer to 37 , Table I to Table IV]. In 1986, Ter Braak applied the hunting spider data set to the CCA method 1 . In 2002, Vladimir Makarenkov and Pierre Legendre also used the data set to verify the PCCA method 22 . Subsequently, other scholars also conducted various analyses over the data set. By using the principal component analysis and the canonical correlation analysis methods, Van der Aart and Smeenk-Enserink found that the number of spider species has a very strict, well defined and reproducible relationship with a major environmental factor (a principal component) 37 . They also established the quadratic regression equations between the number of each kind of spiders and the principal components ( 37 , Fig. 5 and Fig. 6). However, by the analysis of 37 , Fig. 5], the relationship between the number of spiders and the principal components is more complicated, which involves the polynomial relationships and other nonlinear relationships. Therefore, we apply the NCCA method to this hunting spider distribution data set 22,37 , then compare the results of the NCCA to that of the LCCA and the PCCA and discuss these results.
From the 26 environmental factors, the authors of the literatures 22,37 selected four environmental variables on soil, vegetation and light: the soil water content ( x 1 ), the reflection ( x 2 ), the calamagrostis coverage ( x 3 ) and the corynephorus coverage ( x 4 ). For facilitating the comparison with the PCCA, we also select the four environmental variables. Table 3. The effect of noise on the performance of three CCA methods.     www.nature.com/scientificreports/ The data set contains several nonlinear species-environment relationships. After preprocessing the original data, the correlation coefficients between the species and the corynephorus coverage or its three special nonlinear terms are shown as Table 5, from which we can see that all species except s 2 have stronger correlations with the nonlinear terms of the environmental factors when compared with the original environmental factor data. The correlation between the species s 3 and the corynephorus coverage ( x 4 ) is not significant, but the correlations between the species s 3 and three nonlinear terms is significant (the absolute values of correlation coefficient ( −0.6596 , 0.5296, 0.4463) are all greater than the critical value 0.3739). The canonical correlation coefficients between the four original environmental variables X after centralization and Y * are 0.9879, 0.9736, 0.8251, 0.6760, while the first four canonical correlation coefficients of X * obtained from the non-linear transformation of X in the following subsection become 1, 1, 1, and 0.9995. This indicates that the relationship between Y * and nonlinear environmental factors is significant, and it is appropriate to analyze them using the NCCA method. Van der Aart and Smeenk-Enserink discussed this property 37 . We will further discuss the nonlinear relationships in the data set.
The results of the NCCA . By taking five nonlinear functions x + 1 , the significance level of correlation coefficient α = 0.01 , we obtained the matrix X expand . By LASSO regression, we obtained the nonlinear regression models 0.94541 0.95262  Table 6, and the matrix X * which includes four environmental factors x 1 , x 2 , x 3 , x 4 , and 13 nonlinear environmental factors After considering the interaction among the environmental factors, the estimation Ŷ of Y * is obtained. Now, the adjust determination coefficients of the regression estimations have been generally improved, which are listed in the last column of Table 6, of which species s 2 , s 4 and s 7 have the greatest improvements.
The standard error between the estimated value of Y * based on these regression equations in Table 6 and Y * is 0.0058. According to the discussion in the Section of "Numerical simulation", the NCCA method has excellent performance under this noise intensity. The results of the NCCA based on the scaling type 2 are shown in Supplementary Table S1 online. A total of 12 canonical axes are obtained by using the NCCA method. Supplementary Table S1 shows the first six canonical axes, where the eigenvalue is the variance of each canonical axis, and the variance ratio is the ratio of the eigenvalue to the total variance of Y * (The total variance is equal to 1.92296), the cumulative variance ratio represents the cumulative value of the variance ratio. The first four canonical axes account for 75.38650% of the total variance of Y * , which is significantly larger than 43.80558% of the four canonical axes based on the LCCA, and the first two canonical axes also explain 54.80844% of the total variance. Therefore, an effective biplot ordination can be obtained in two-dimensional space.
By using the species scores, the site scores and the environmental factor scores in the columns "Axes 1" and "Axes 2" of Supplementary Table S1 , we plot the NCCA biplots as Figs. 1 and 2, according to the strategy 1 and the strategy 2, respectively. In both figures, the coordinates of the environmental factors are all magnified by 5 times. The dense area in the second quadrant of Fig. 1 or Fig.2 is drawn in in Supplementary Fig. S1 It should be noted that the position of an environmental factor in Fig. 2 is determined by the original variable and its nonlinear terms. Figure 2 is drawn according to the strategy 2, which is not the same as Fig. 1 drawn under the strategy 1. The strategy 2 mainly integrates the nonlinear terms of the environmental variables into the original environmental variables, which makes the positions of environmental factors drawn according to this strategy thus reflect the influence of the nonlinear terms of the original variable. By comparing the environmental variables x 1 -x 4 in Figs. 1 and 2, we can find that the angle and the position of x 3 (the calamagrostis coverage) are still similar, while those of x 1 (the soil water content), x 2 (the reflection) and x 4 (the corynephorus coverage) are quite different. The results indicate the nonlinear terms of the environmental variable x 3 had less impacts on a species, but that of x 1 , x 2 and x 4 had more impacts on a species. Under the action of the nonlinear terms of the environmental variables, the directions of x 1 and x 4 are almost the same, but the length of the arrow x 4 is still greater than that of the arrow x 2 , which indicates x 4 is of more importance over x 2 .
As mentioned in Section of "Significance test in the NCCA ", the significance of the NCCA model can be evaluated by performing permutation tests. Taking the standard deviation δ = 0.001 , N = 1000 and N 1 = 100 , after 227 permutations, the p-value reached 0.0044, then the test automatically stopped. Figure 3 shows the  www.nature.com/scientificreports/ probability of a species decreases with the increasing of distance from the position of the species on a biplot. In Figs. 1 and 2, the species closest to the site 26 are s 5 , s 3 , s 1 and s 8 in proper order, which is consistent with the order of the number of these species on the site 26. The species s 3 is relatively close to the site 22, the site 23, the site 24, the site 27 and the site 28, which indicates that the species s 3 mainly appears at these site points. Indeed, the number of the species s 3 is the largest at these site points (see 22 , Table 1]). This distribution characteristic is basically consistent with the result calculated by the PCCA (see 22 , Fig. 5]). Therefore, just as the PCCA, the NCCA successfully restores the arch distorted by the LCCA, which represents the gradient in the correspondence analysis biplot. However, this recovery is realized by incorporating the nonlinear terms of the environmental variables in the species interpretation equation.
The influence of the environmental factors on the distribution of species. The species points and the environmental factor arrows in the biplot reflect the distribution of the species along each environmental factor. By projecting the species points onto an environmental factor arrow by using the vertical lines, the arrangement of these projection points on the environmental factor arrow represents the impact of the environmental factor on each species. For example, when the arrow refers to x 1 (the soil water content), we can roughly divide it into three parts: the positive direction of the arrow, near the origin and the negative direction of the arrow, which respectively represent the wetness, the moderate humidity and the dryness. Therefore, we can infer which species mainly appearing in the wet places, the dry places or the places with moderate humidity values. Accordingly, we know that all the sites points can be classified into three main groups. The first group representing the driest areas includes the site points 22 − 24 , and 26 − 28 that are mainly highly positively correlated with the species s 3 and s 5 . The second group representing the wet areas is the second quadrant of Fig. 1 or Fig. 2 and these areas mainly involve the site points 2, 4 − 8 , and 13 − 21 that are of mainly high frequency correlation with the species s 4 , s 6 , s 7 and s 9 − s 12 (see Supplementary Fig. S1). The site points 1, 3, 9 − 12 and 25 represent the medium humidity areas, which is mainly related to the value of the species s 8 of high frequency. The more dispersed the arrangement of species along an environmental factor arrow, the more effective it is to distinguish the impact of environmental factors on the species distribution. In Fig. 1, the distribution of the projection points of some species along environmental factor arrows is obviously more dispersed than that of that along their original environmental factor,such as x 1 and (x 1 + 1) −3 , x 4 and log(x 4 + 1) . Therefore, we can see in Fig. 2 that the arrow positions of the environmental variables x 1 and x 4 are shifted to the top-right when compare to those in Fig. 1, which can more effectively represent the impacts of x 1 and x 4 on the species distribution.
We know that the relative importance of the impacts of different environmental factors on species can be indicated by the arrow length, that is, the longer the arrow, the greater the impact. In Fig. 1, the arrow length of the nonlinear environmental factor log( is longer than that of x 1 , which is also true for log(x 4 + 1) ( √ x 4 + 1 , (x 4 + 1) −3 ) and x 4 . Knowing from this, these nonlinear environmental factors have a greater impacts on the species distribution than their original environmental factors. The nonlinear factors of x 3 have the least impacts on the species distribution, the arrow length and positions of these nonlinear factors in Fig. 2 have thus hardly changed, while the arrow lengths of other environmental factors have been significantly lengthened.
The comparison of the NCCA with the LCCA and the PCCA . We know the three models, the LCCA, the PCCA and the NCCA are all highly significant over the spider species data according to the permutation test. If a model can explain more variances in Y * , then its performance is better. The total variance in Y * is 1.92296. Although all 12 canonical axes generated by the NCCA are unable to explain the total variances in Y * , they account together for 81.22888% of total variances. As a comparing, this ratio for the PCCA is 80.3% , while that for the LCCA is only 43.80558% . The cumulative variances ratios of the first six canonical axes of the three models are listed in Table 7.
Knowing from Table 7, the NCCA is the best one of the three models in the cumulative variance ratios of all canonical axes. For example, the first two canonical axes of the NCCA explain 54.80844% of the total variance, which is significantly higher than the 37.32036% obtained by the LCCA and the 52.21410% obtained by the PCCA. Moreover, the cumulative variance ratio of the NCCA on the first six canonical axes reaches at 79.24605%.
The permutation test introduced in Section of "Significance test in the NCCA " can be employed to evaluate the variance difference significance of the three models. After 227 permutations, the significance p value of the variance difference between the NCCA and the LCCA is 0.0044, which indicates that the NCCA is more suitable than the LCCA for addressing the spider species data, and which is also true for the case between the NCCA and the PCCA. Table 7. Cumulative variance ratios of three CCA models for the spider species data. The cumulative variance ratios of PCCA cited the data of the literature 22 .

Conclusion
In ecology, the canonical correspondence analysis is widely used for analyzing the response variables (e.g the species variables) and the explanatory variables (e.g. the environmental variables), and the regression part of the classical CCA method is linear. However, the linear relationships are impractical in most cases. From the ecological perspective, the nonlinear describing of the relationship between the response variables and the explanatory variables is more accurate and practical than the LCCA model. From this, we propose a NCCA method based on fundamental CCA theories to solve the problem of more complex nonlinear relationships between the response variables and the explanatory variables.
In solving the problem of how to describe the nonlinear relationship between the response variables and the explanatory variables, the nonlinear transformations of the original explanatory variable data matrix is first performed, next, in order to avoid model over-fitting, we use the correlation and the LASSO regression to select the most appropriate explanatory variables and their nonlinear terms twice. Subsequently, the linear regression and the improved heuristic optimal quadratic approximation method are employed to fit the response variables data.
We solve the problem of how to describe the nonlinear factors with an ordination biplot. Under no many explanatory variables and their nonlinear terms, all of which can be shown on a biplot. Otherwise, only the original explanatory variables are drawn on a biplot, in which the coordinates are the complex correlation coefficients between the explanatory variables (including their related nonlinear terms) and the ordination axis. Because the coordinate of an explanatory variable arrow is determined by the complex correlation coefficient, this coordinate reflects the influence of the nonlinear factors.
We use the permutation method to test the significance of NCCA model. Numerical simulation shows that NCCA has good performance in preventing type I error and type II error. The application of our NCCA method to a real data-set (i.e. the predatory spider distribution data-set) also shows its effectiveness. Our method inherits the advantages of the PCCA, for example, it can recover the arch representing gradient distorted by the LCCA. Furthermore, the canonical axes in our method have greatly improved the ability to explain the variance of the original response variables, and the corresponding biplot can more effectively reflect the impacts of the explanatory variables on the distributions of the response variables.
Our method also solves the problem of nonlinear assumptions between the species and the explanatory variables by adding the nonlinear terms of some explanatory variables to the matrix of the original explanatory variables. However, accurately determining the nonlinear form in actual calculations is still difficult. In the practical examples in Section "Experiments", we only calculated five special forms. With the increasing of nonlinear function forms, the computational complexity undoubtedly will greatly increase, which needs to be further studied in detail.